Inferring missing genotypes in large SNP panels using fast nearest-neighbor searches over sliding windows
نویسندگان
چکیده
MOTIVATION Typical high-throughput genotyping techniques produce numerous missing calls that confound subsequent analyses, such as disease association studies. Common remedies for this problem include removing affected markers and/or samples or, otherwise, imputing the missing data. On small marker sets imputation is frequently based on a vote of the K-nearest-neighbor (KNN) haplotypes, but this technique is neither practical nor justifiable for large datasets. RESULTS We describe a data structure that supports efficient KNN queries over arbitrarily sized, sliding haplotype windows, and evaluate its use for genotype imputation. The performance of our method enables exhaustive exploration over all window sizes and known sites in large (150K, 8.3M) SNP panels. We also compare the accuracy and performance of our methods with competing imputation approaches. AVAILABILITY A free open source software package, NPUTE, is available at http://compgen.unc.edu/software, for non-commercial uses.
منابع مشابه
Improvement of missing genotype imputation through bi - directional parsing of large SNP panels Christine Sinoquet
Such difficult analyses as disease association studies, which aim at mappping genetic variants underlying complex human diseases, rely on high-throughput genotyping techniques. However, a shortcoming of these techniques is the generation of missing calls. Computational inference of missing data represents a challenging alternative to genotyping again the missing regions. In this paper, we prese...
متن کاملLinkImpute: Fast and Accurate Genotype Imputation for Nonmodel Organisms
Obtaining genome-wide genotype data from a set of individuals is the first step in many genomic studies, including genome-wide association and genomic selection. All genotyping methods suffer from some level of missing data, and genotype imputation can be used to fill in the missing data and improve the power of downstream analyses. Model organisms like human and cattle benefit from high-qualit...
متن کاملIterative Two-Pass Algorithm for Missing Data Imputation in SNP Arrays
Though nowadays high-throughput genotyping techniques' quality improves, missing data still remains fairly common. Studies have shown that even a low percentage of missing SNPs is detrimental to the reliability of down-stream analyses such as SNP-disease association tests. This paper investigates the potentiality for improving the accuracy of an SNP inference method based on the algorithm forme...
متن کاملTERZIĆ: SHAPE DETECTION WITH NEAREST NEIGHBOUR CONTOUR FRAGMENTS 1 Shape Detection with Nearest Neighbour Contour Fragments
We present a novel method for shape detection in natural scenes based on incomplete contour fragments and nearest neighbour search. In contrast to popular methods which employ sliding windows, chamfer matching and SVMs, we characterise each contour fragment by a local descriptor and perform a fast nearest-neighbour search to find similar fragments in the training set. Based on this idea, we sho...
متن کاملImputing missing genotypes with weighted k nearest neighbors.
Missing values are a common problem in genetic association studies concerned with single-nucleotide polymorphisms (SNPs). Since many statistical methods cannot handle missing values, such values need to be removed prior to the actual analysis. Considering only complete observations, however, often leads to an immense loss of information. Therefore, procedures are required that can be used to im...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Bioinformatics
دوره 23 13 شماره
صفحات -
تاریخ انتشار 2007